Re: Zedstore - compressed in-core columnar storage - Mailing list pgsql-hackers
From | DEV_OPS |
---|---|
Subject | Re: Zedstore - compressed in-core columnar storage |
Date | |
Msg-id | 99fefcc5-8320-a5d1-5d6d-cbe339a2f151@ww-it.cn>+C9B38613AA4CDF77 Whole thread Raw |
In response to | Re: Zedstore - compressed in-core columnar storage (Ashwin Agrawal <aagrawal@pivotal.io>) |
List | pgsql-hackers |
it's really cool and very good progress, I'm interesting if SIDM/JIT will be supported best wishes TY On 2019/5/23 08:07, Ashwin Agrawal wrote: > We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to > share the recent additions and modifications. Attaching a patch > with the latest code. Link to github branch [1] to follow > along. The approach we have been leaning towards is to build required > functionality, get passing the test and then continue to iterate to > optimize the same. It's still work-in-progress. > > Sharing the details now, as have reached our next milestone for > Zedstore. All table AM API's are implemented for Zedstore (except > compute_xid_horizon_for_tuples, seems need test for it first). > > Current State: > > - A new type of item added to Zedstore "Array item", to boost > compression and performance. Based on Konstantin's performance > experiments [2] and inputs from Tomas Vodra [3], this is > added. Array item holds multiple datums, with consecutive TIDs and > the same visibility information. An array item saves space compared > to multiple single items, by leaving out repetitive UNDO and TID > fields. An array item cannot mix NULLs and non-NULLs. So, those > experiments should result in improved performance now. Inserting > data via COPY creates array items currently. Code for insert has not > been modified from last time. Making singleton inserts or insert > into select, performant is still on the todo list. > > - Now we have a separate and dedicated meta-column btree alongside > rest of the data column btrees. This special or first btree for > meta-column is used to assign TIDs for tuples, track the UNDO > location which provides visibility information. Also, this special > btree, which always exists, helps to support zero-column tables > (which can be a result of ADD COLUMN DROP COLUMN actions as > well). Plus, having meta-data stored separately from data, helps to > get better compression ratios. And also helps to further simplify > the overall design/implementation as for deletes just need to edit > the meta-column and avoid touching the actual data btrees. Index > scans can just perform visibility checks based on this meta-column > and fetch required datums only for visible tuples. For tuple locks > also just need to access this meta-column only. Previously, every > column btree used to carry the same undo pointer. Thus visibility > check could be potentially performed, with the past layout, using > any column. But considering overall simplification new layout > provides it's fine to give up on that aspect. Having dedicated > meta-column highly simplified handling for add columns with default > and null values, as this column deterministically provides all the > TIDs present in the table, which can't be said for any other data > columns due to default or null values during add column. > > - Free Page Map implemented. The Free Page Map keeps track of unused > pages in the relation. The FPM is also a b-tree, indexed by physical > block number. To be more compact, it stores "extents", i.e. block > ranges, rather than just blocks, when possible. An interesting paper [4] > on > how modern filesystems manage space acted as a good source for ideas. > > - Tuple locks implemented > > - Serializable isolation handled > > - With "default_table_access_method=zedstore" > - 31 out of 194 failing regress tests > - 10 out of 86 failing isolation tests > Many of the current failing tests are due to plan differences, like > Index scans selected for zedstore over IndexOnly scans, as zedstore > doesn't yet have visibility map. I am yet to give a thought on > index-only scans. Or plan diffs due to table size differences between > heap and zedstore. > > Next few milestones we wish to hit for Zedstore: > - Make check regress green > - Make check isolation green > - Zedstore crash safe (means also replication safe). Implement WAL > logs > - Performance profiling and optimizations for Insert, Selects, Index > Scans, etc... > - Once UNDO framework lands in Upstream, Zedstore leverages it instead > of its own version of UNDO > > Open questions / discussion items: > > - how best to get "column projection list" from planner? (currently, > we walk plan and find the columns required for the query in > the executor, refer GetNeededColumnsForNode()) > > - how to pass the "column projection list" to table AM? (as stated in > initial email, currently we have modified table am API to pass the > projection to AM) > > - TID treated as (block, offset) in current indexing code > > - Physical tlist optimization? (currently, we disabled it for > zedstore) > > Team: > Melanie joined Heikki and me to write code for zedstore. Majority of > the code continues to be contributed by Heikki. We are continuing to > have fun building column store implementation and iterate > aggressively. > > References: > 1] https://github.com/greenplum-db/postgres/tree/zedstore > 2] > https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru > 3] > https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development > 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf >
pgsql-hackers by date: